PLSC30500, Fall 2024

Part 2. Summarizing distributions (part a)

Andy Eggers

Expectations

Motivation

Suppose we have a random variable \(X\), e.g.

  • number of heads in 3 coin flips,
  • whether randomly selected voter plans to vote for Brexit
  • income of randomly selected citizen

We know the PMF/PDF \(f(x)\) and CDF \(F(x)\).

How can we summarize this distribution?

Expected value: discrete case

For discrete R.V. with probability mass function (PMF) \(f(x)\), the expected value of \(X\) is

\[{\textrm E}\,[X] = \sum_x x f(x) \]

Could write

\[{\textrm E}\,[X] = \sum_{x \in \text{Supp}[X]} x f(x) \]

Example

\(x\) \(f(x)\)
0 .2
1 .5
3 .3

\[ f(x) = \begin{cases} \, .2 & x = 0 \\\ .5 & x = 1 \\\ .3 & x = 3 \\\ 0 & \text{otherwise} \end{cases} \]

What is \({\textrm E}\,[X] = \sum_x x f(x)\)?

\[\begin{aligned} {\textrm E}\,[X] &= 0 \times .2 + 1 \times .5 + 3 \times .3 \\ &= 1.4 \end{aligned}\]

Same PMF

Expectation vs average

You know about taking the average or mean of a set of numbers \(x_1, x_2, \ldots, x_n\):

\[ \overline{x} = \frac{1}{n} \sum_{i = 1}^n x_i \]

Just as a probability is a long-run frequency, an expectation is a long-run average.

  • \({\textrm E}\,[X]\) summarizes a random variable \(X\); \(\overline{x}\) summarizes a set of numbers
  • if each \(x_i\) is an independent sample from \(X\), \(\overline{x}\) approximates \({\textrm E}\,[X]\) (more closely with larger samples)
  • if each \(x \in \text{Supp}[X]\) appears in the sample with frequency exactly \(f(x)\), then \(\overline{x} = {\textrm E}\,[X]\)

Expectation vs average (2)

Given PMF:

\[ f(x) = \begin{cases} \, .2 & x = 0 \\\ .5 & x = 1 \\\ .3 & x = 3 \\\ 0 & \text{otherwise} \end{cases} \]

Then \({\textrm E}\,[X] = 0 \times .2 + 1 \times .5 + 3 \times .3 = 1.4\).

Alternative method: make a vector of length \(n\) where each \(x\) appears \(n f(x)\) times:

x <- c(0, 0, 1, 1, 1, 1, 1, 3, 3, 3)
mean(x)
[1] 1.4

Why does this work?

Expectation vs average (3)

If each unique \(x\) appears \(n f(x)\) times, then

\[\begin{aligned} \overbrace{\frac{1}{n} \sum_i x_i}^{\text{Average}} &= \frac{1}{n} \sum_x x n f(x) \\ &= \frac{n}{n} \sum_x x f(x) = {\textrm E}\,[X] \end{aligned}\]

May clarify why, in any given sample, \(\overline{x} \neq {\textrm E}\,[X]\).

The continuous case

For continuous R.V. \(X\),

\[ {\textrm E}\,[X] = \int_{-\infty}^{\infty} x f(x) dx \]

\({\textrm E}\,[X]\) as a “good guess”

For R.V. \(X\), consider \((X - c)^2\) for some constant \(c\). (A function of a random variable.)

Define mean squared error of \(X\) about \(c\) as \({\textrm E}\,[(X - c)^2]\).

For \(c=1\), we have:

\(x\) \(f(x)\) \((x - 1)^2\)
0 .2 1
1 .5 0
3 .3 4

So MSE of \(X\) about \(1\) is:

\[ .2 \times 1 + .5 \times 0 + .3 \times 4 = 2.2 \]

\({\textrm E}\,[X]\) is the choice of \(c\) that minimizes MSE. (Wait for proof.)

Bonus: \({\textrm E}\,[X]\) as X’s center of mass

Suppose we place a weight \(f(x)\) at each value \(x \in \text{Supp}(X)\) along a weightless rod.

Where is the center of mass, i.e. point where rod balances?

It is the point \(c\) where \(\sum_x (x - c) f(x) = 0\).

That point is \({\textrm E}\,[X]\).

Proof:

\[\begin{aligned} \sum_x (x - E[X]) f(x) &= \sum_x \left( x f(x) - E[X] f(x) \right) \\ &= \sum_x x f(x) - \sum_x E[X] f(x) \\ &= E[X] - E[X] \sum_x f(x) \\ &= E[X] - E[X] \\ &= 0 \end{aligned}\]

Expectation of a function of two RVs

Consider this joint PMF \(f(x, y)\) for \(X\) and \(Y\) (e.g. state 1 militarizes, state 2 militarizes)

\(x\) \(y\) \(f(x,y)\)
0 0 1/10
0 1 1/5
1 0 1/5
1 1 1/2

\[ f(x,y) = \begin{cases} \, 1/10 & x = 0, y = 0 \\\ 1/5 & x = 0, y = 1 \\\ 1/5 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]

What is \({\textrm E}\,[XY]\)? (will need e.g. for covariance)

\[\begin{aligned} {\textrm E}\,[XY] &\equiv \sum_x \sum_y xy f(x, y) \\ &= 0 \times 1/10 + 0 \times 1/5 + 0 \times 1/5 + 1 \times 1/2 \\ &= 1/2 \end{aligned}\]

Linearity of expectations

Let \(X\) and \(Y\) be RVs. Then \(\forall a, b, c \in \mathbb{R}\), \({\textrm E}\,[aX + bY + c] = a{\textrm E}\,[X] + b{\textrm E}\,[Y] + c\)

Proof (discrete case):

\[\begin{align} {\textrm E}\,[aX + bY + c] &= \sum_x \sum_y (ax + by + c) f(x,y) \\ &= \sum_x \sum_y ax f(x,y) + \sum_x \sum_y by f(x,y) + \sum_x \sum_y c f(x,y) \\ &= a \sum_x \sum_y x f(x,y) + b \sum_x \sum_y y f(x,y) + c \sum_x \sum_y f(x,y) \\ &= a \sum_x x \sum_y f(x,y) + b \sum_y y \sum_x f(x,y) + c \sum_x \sum_y f(x,y) \\ &= a \sum_x x f_X(x) + b \sum_y y f_Y(y) + c \sum_x f_X(x)\\ &= a {\textrm E}\,[X] + b {\textrm E}\,[Y] + c \end{align}\]

Example of code interpretation

Consider this code:

samp <- sample(x = c("a", "b", "c"), 
               size = 1000, 
               replace = T, 
               prob = c(.1, .3, .6))
## R help file says: 
# sample takes a sample of the specified size from the elements of x either with or without replacement.
## function arguments: 
# x: either a vector of one or more elements from which to choose, or a positive integer. 
# size: a non-negative integer giving the number of items to choose.
# replace: should sampling be with replacement?
# prob: a vector of probability weights for obtaining the elements of the vector being sampled.
tens <- rep(10, 1000)

What would the output of mean(samp == "a") be (approximately)?

Answer: It should be about .1, the probability of drawing an “a”.

What would the output of sum(tens[samp == "b"]) be (approximately)?

Answer: It should be about \(10 \times 300 = 3000\). (About 300 of the entries in samp should be “b”, so tens[samp == "b] should be a vector of about 300 10s, and the sum of this should be about 3000.)

Variance

Variance: definition and example

\[{\textrm V}\,[X] \equiv {\textrm E}\,[(X - {\textrm E}\,[X])^2]\]

For Bernoulli RV (\(x=1\) means heads, revolution?):

\[ f(x) = \begin{cases} 1- p & x = 0 \\\ p & x = 1 \\\ 0 & \text{otherwise} \end{cases} \]

We can compute \((X - {\textrm E}\,[X])^2\) at each \(x\):

\(x\) \(f(x)\) \((x - {\textrm E}\,[X])^2\)
0 \(\color{green}{1-p}\) \(\color{red}{p^2}\)
1 \(\color{blue}{p}\) \(\color{orange}{1 - 2p + p^2}\)

And then variance as \({\textrm E}\,[(X - {\textrm E}\,[X])^2]\):

\[\begin{aligned} {\textrm V}\,[X] &= {\textrm E}\,[(x - {\textrm E}\,[X])^2] \\ &= \color{red}{p^2}\color{green}{(1-p)} + \color{orange}{(1 - 2p + p^2)}\color{blue}{p} \\ &= p^2 - p^3 + p - 2p^2 + p^3 \\ &= p(1 - p)\end{aligned}\]

Graphical example

Variance: alternative formulation

Bernoulli example again:

\[ f(x) = \begin{cases} 1- p & x = 0 \\\ p & x = 1 \\\ 0 & \text{otherwise} \end{cases} \]

Alternative formulation for variance:

\[{\textrm V}\,[X] = {\textrm E}\,[X^2] - {\textrm E}\,[X]^2\]

What is \({\textrm E}\,[X]\)? What is \({\textrm E}\,[X^2]\)?

By this alternative formula, we then have

\[{\textrm V}\,[X] = p - p^2 = p(1-p)\]

Proof that \({\textrm V}\,[X] = {\textrm E}\,[X^2] - {\textrm E}\,[X]^2\)

\[\begin{align} {\textrm V}\,[X] &= {\textrm E}\,\left[(X - {\textrm E}\,[X])^2\right] \\ &= {\textrm E}\,\left[X^2 - \color{blue}{2{\textrm E}\,[X]} X + {\textrm E}\,[X]^2\right] \\ &= {\textrm E}\,[X^2] - {\textrm E}\,\left[\color{blue}{2{\textrm E}\,[X]} X\right] + {\textrm E}\,\left[{\textrm E}\,[X]^2\right] \\ &= {\textrm E}\,[X^2] - \color{blue}{2{\textrm E}\,[X]} {\textrm E}\,[X] + {\textrm E}\,[X]^2 \\ &= {\textrm E}\,[X^2] - 2 {\textrm E}\,[X]^2 + {\textrm E}\,[X]^2 \\ &= {\textrm E}\,[X^2] - {\textrm E}\,[X]^2 \end{align}\]

Properties of variance

For a random variable \(X\),

  • \(\forall c \in \mathbb{R}\), \({\textrm V}\,[X + c] = {\textrm V}\,[X]\)
  • \(\forall a \in \mathbb{R}\), \({\textrm V}\,[aX] = a^2{\textrm V}\,[X]\)

Proof of first point:

\[\begin{align} {\textrm V}\,[X + c] &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X + c])^2\right] \\ &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X] - {\textrm E}\,[c])^2\right] \\ &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X] - c)^2\right] \\ &= {\textrm E}\,\left[(X - {\textrm E}\,[X])^2\right] \\ &= {\textrm V}\,[X] \end{align}\]

Standard deviation

Standard deviation:

\[\sigma[X] = \sqrt{{\textrm V}\,[X]}\]

Roughly, “how far does \(X\) tend to be from its mean”?

Revisiting justification for \({\textrm E}\,[X]\)

Alternative formula for MSE:

\[\begin{align} {\textrm E}\,[(X - c)^2] &= {\textrm E}\,\left[X^2 - 2cX + c^2\right] \\ &= {\textrm E}\,[X^2] - 2c{\textrm E}\,[X] + c^2 \\ &= {\textrm E}\,[X^2] - \color{red}{{\textrm E}\,[X]^2} + \color{green}{{\textrm E}\,[X]^2} - 2c{\textrm E}\,[X] + c^2 \\ &= \left({\textrm E}\,[X^2] - \color{red}{{\textrm E}\,[X]^2}\right) + \left(\color{green}{{\textrm E}\,[X]^2} - 2c{\textrm E}\,[X] + c^2\right) \\ &= {\textrm V}\,[X] + \left({\textrm E}\,[X] - c\right)^2 \end{align}\]

So what \(c\) should you choose to minimize MSE?

Parametric distributions

Given any RV \(X\) with associated PMF/PDF or CDF, we can compute \({\textrm E}\,[X]\) and \({\textrm V}\,[X]\).

For special types of RV, these parameters define the distribution:

  • Bernoulli distribution identified by \({\textrm E}\,[X]\) (\(p\))
  • Normal distribution identified by \({\textrm E}\,[X]\), \({\textrm V}\,[X]\) (\(\mu\), \(\sigma^2\))

But don’t get confused: any RV has \({\textrm E}\,[X]\) and \({\textrm V}\,[X]\), not just these special ones.

Other things not to get confused about

Sometimes “mean” means \({\textrm E}\,[X]\) (e.g. mean squared error or \(\mu\) of a normal distribution), sometimes it means “sample mean” or “average” of some numbers (e.g. mean(c(2,4,6))).

Sometimes “variance” means \({\textrm V}\,[X]\), sometimes it means “sample variance” (e.g. var(c(2,4,6))).

There is a close relationship, but remember that

  • \({\textrm E}\,[X]\) and \({\textrm V}\,[X]\) are operators that convert a distribution (PMF, PDF, CDF) into a number, and
  • mean() and var() are R functions that convert a vector of numbers (e.g. a sample) into a number.